Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation 9.
car name: string (unique for each instance)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.metrics import recall_score,precision_score,confusion_matrix,classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
# more libraries will add of bagging,adaboosting etc
df=pd.read_csv('Data - Parkinsons.csv')
df.head(10)
df.tail(10)
df.dtypes
df.shape
print(df.isna().sum().sum())
print(df.isnull().sum().sum())
No columns have null data in the file
df.describe().transpose()
Observation:
adf= df.drop('name',axis=1)
adf
plt.figure(figsize= (30,50)) # Set the figure size
pos = 1 # a variable to manage the position of the subplot in the overall plot
for feature in adf.columns: # for-loop to iterate over every attribute whose distribution is to be visualized
plt.subplot(8, 3, pos) # plot grid
if feature not in ['status']: # Plot histogram for all the continuous columns
sns.distplot(df[feature], kde= True )
else:
sns.countplot(df[feature], palette= 'jet_r') # Plot bar chart for all the categorical columns
pos += 1 # to plot over the grid one by one
observation
- Attribute "MDVP:Fo(Hz)" is slightly left skewed
- Attribute "MDVP:Fhi(Hz)", is highly left skewed, most outliers lies on right side.
- Atteibute "MDVP:Flo(Hz)" is left skewed but has less number of outliers
- Attribute-"MDVP:Jitter(%)","MDVP:PPQ","Jitter:DDP","MDVP:Shimmer","MDVP:Shimmer(dB)","Shimmer:APQ3","Shimmer:APQ5","MDVP:APQ" ,"Shimmer:DDA ","NHR", "MDVP:Jitter(Abs)" and "MDVP:RAP" are highly left skewed and has high number of outliers.
- Attribute "HNR", "RPDE,D2","DFA","spread1,spread2,PPE" are quite normaly distributed.
- Attribute "status" or target class are imbalance, more people have parkinsons disease.
adf['status'].value_counts()
Out of 195 samples 145 are detected as PD.
sns.pairplot(adf, hue = 'status', diag_kind='kde',height=3 )
plt.show()
sns.set(style="white")
# Compute the correlation matrix
corr = adf.corr()
# Generate a mask for the upper triangle
#mask = np.triu(np.ones_like(corr, dtype=np.bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 18))
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, annot=True,cmap='YlGnBu', vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
Observation
- MDVP:Jitter(%) has hiegher correlation with every attribute except Spread2,RPDE,Status.
- MDVP:Jitter(%) has correlation more than 95% with MDVP:RAP,MDVP:PPQ,MDVP:Jitter(Abs) and Jitter(DDP).
- MDVP:Shimmer has correction more than 95% with MDVP:Shimmer(dB) ,Shimmer:DDA,Shimmer:APQ3,Shimmer:APQ5 and MDVP:APQ.
- Status and PPE has higher correlation.
- Spread1 and PPE has correlation of 96%.
reg= adf.drop(columns=['MDVP:RAP','MDVP:PPQ','MDVP:Jitter(Abs)','Jitter:DDP','MDVP:Shimmer(dB)' ,
'Shimmer:DDA','Shimmer:APQ3','Shimmer:APQ5', 'MDVP:APQ'])
reg
col_x = reg.drop(columns='status') # Predictors
col_y = y = reg['status'] # target
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
X_train, X_test, y_train, y_test = train_test_split(col_x, col_y, test_size=0.30,random_state=20)
sc=StandardScaler()
scaledX_train = sc.fit_transform(X_train)
scaledX_test = sc.transform(X_test)
#We will build KNN model with n_neighbors = 3
knn = KNeighborsClassifier(n_neighbors=3)
# Build step forward feature selection
sfs1 = sfs(knn, k_features=13, forward=True, scoring='f1', cv=5)
# Perform step forward feature selection
sfs1 = sfs1.fit(scaledX_train, y_train)
sfs1.get_metric_dict()
0, 2, 4, 6, 7, 9, 10, 11, 12),
- With all 13 feature the average F1 score is 91%.
- If we reduce number of features then average F1 score is increasing, this leads to conclusion that not instead of taking all feature to build model we can use important features for building model.
- from above SFFS learning we can see for feature number 0, 2, 4, 6, 7, 9, 10, 11, 12 we are getting maximum F1 score i.e. 94.3%
l = col_x.columns
l[[0, 2, 4, 6, 7, 9, 10, 11, 12]]
neighbors = np.arange(1, 19)
train_accuracy_plot = np.empty(len(neighbors))
test_accuracy_plot = np.empty(len(neighbors))
X=reg[['MDVP:Fo(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Shimmer', 'HNR', 'RPDE', 'spread1',
'spread2', 'D2', 'PPE']]
y=reg['status']
# Loop over different values of k
for i, k in enumerate(neighbors):
train = []
test = []
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=11)
sc=StandardScaler()
scaledX_train = sc.fit_transform(X_train)
scaledX_test = sc.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(scaledX_train,y_train)
train.append(knn.score(scaledX_train,y_train))
test.append(knn.score(scaledX_test,y_test))
#Compute accuracy on the training set
train_accuracy_plot[i] = np.mean(train)
#Compute accuracy on the testing set
test_accuracy_plot[i] = np.mean(test)
# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy_plot, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy_plot, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()
We can see here that when k = 6 , the testing accuracy is highest.
from sklearn.metrics import recall_score,precision_score,confusion_matrix,classification_report
X=reg[['MDVP:Fo(Hz)', 'MDVP:Flo(Hz)', 'MDVP:Shimmer', 'HNR', 'RPDE', 'spread1',
'spread2', 'D2', 'PPE']]
y=reg['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=11)
sc=StandardScaler()
scaledX_train = sc.fit_transform(X_train)
scaledX_test = sc.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(scaledX_train,y_train)
print('Accuracy score for training set :', knn.score(scaledX_train,y_train)*100)
print('Accuracy score for testing set :', knn.score(scaledX_test,y_test)*100)
pred=knn.predict(scaledX_test)
print('Confusion Matrix : ')
print(confusion_matrix(y_test,pred))
print('classification_report : ')
print(classification_report(y_test,pred))
print('Recall Score : ', recall_score(y_test,pred))
print('Precision Score : ', precision_score(y_test,pred))
X_train, X_test, y_train, y_test = train_test_split(col_x, col_y, test_size=0.30,random_state=20)
sc=StandardScaler()
scaledX_train = sc.fit_transform(X_train)
scaledX_test = sc.transform(X_test)
logr = LogisticRegression()
sfs1 = sfs(logr, k_features=12, forward=True, scoring='f1', cv=5)
sfs1 = sfs1.fit(scaledX_train, y_train)
sfs1.get_metric_dict()
From above SFFS learning we can see for feature number 2, 3, 5, 7, 9, 11, 12 we are getting maximum F1 score i.e. 93.97%
l = col_x.columns
l[[2, 3, 5, 7, 9, 11, 12]]
X = reg[['MDVP:Flo(Hz)', 'MDVP:Jitter(%)', 'NHR', 'RPDE', 'spread1', 'D2',
'PPE']]
y = reg['status']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30,random_state=25)
sc=StandardScaler()
scaledX_train = sc.fit_transform(X_train)
scaledX_test = sc.transform(X_test)
#create an instance and fit the model
logr = LogisticRegression()
logr.fit(scaledX_train, y_train)
#predictions
log_pred = logr.predict(scaledX_test)
print('Accuracy score for training set :', logr.score(scaledX_train,y_train)*100)
print('Accuracy score for testing set :', logr.score(scaledX_test,y_test)*100)
log_pred=logr.predict(scaledX_test)
print('Confusion Matrix : ')
print(confusion_matrix(y_test,log_pred))
print(classification_report(y_test,log_pred))
print('Recall Score : ', recall_score(y_test,log_pred))
print('Precision Score : ', precision_score(y_test,log_pred))
Above Logiststic regression model having accuracy score of (test accuracy = 86.44% and train accuracy = 88.97) and F1 score = 91% .
X_train, X_test, y_train, y_test = train_test_split(col_x, col_y, test_size=0.30,random_state=20)
sc=StandardScaler()
scaledX_train = sc.fit_transform(X_train)
scaledX_test = sc.transform(X_test)
gb = GaussianNB()
sfs1 = sfs(gb, k_features=12, forward=True, scoring='f1', cv=5)
sfs1 = sfs1.fit(scaledX_train, y_train)
sfs1.get_metric_dict()
('0', '6', '9', '11')
l=col_x.columns
l[[0,6, 9, 11]]
from sklearn.naive_bayes import GaussianNB
# Spliting the data
X = reg[['MDVP:Fo(Hz)', 'HNR', 'spread1', 'D2']]
y = reg['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=20)
gb = GaussianNB()
gb.fit(scaledX_train, y_train)
# making predictions
gb_pred = gb.predict(scaledX_test)
print('Accuracy score for training set :', gb.score(scaledX_train,y_train)*100)
print('Accuracy score for testing set :', gb.score(scaledX_test,y_test)*100)
print('Confusion Matrix : ')
print(confusion_matrix(y_test,gb_pred))
print(classification_report(y_test,gb_pred))
print('Recall Score : ', recall_score(y_test,gb_pred))
print('Precision Score : ', precision_score(y_test,gb_pred))
Above Naive baiyes model having accuracy score of (test accuracy = 69.49% and train accuracy = 77.94) and F1 score = 75% .
rf= RandomForestClassifier(n_estimators = 50)
rf = rf.fit(X_train, y_train)
pred_RF = rf.predict(X_test)
acc_RF = metrics.accuracy_score(y_test, pred_RF)
acc_RF
Rdf = pd.DataFrame({'Method':['Random Forest'], 'accuracy': [acc_RF]})
resultsDf = Rdf[['Method', 'accuracy']]
resultsDf
abcl = AdaBoostClassifier( n_estimators= 100, learning_rate=0.1, random_state=22)
abcl = abcl.fit(X_train, y_train)
pred_AB =abcl.predict(X_test)
acc_AB = metrics.accuracy_score(y_test, pred_AB)
tempResultsDf = pd.DataFrame({'Method':['Adaboost'], 'accuracy': [acc_AB]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
resultsDf
Adaboost performs better than Random forest.
bgcl = BaggingClassifier(n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=22)
bgcl = bgcl.fit(X_train, y_train)
pred_BG =bgcl.predict(X_test)
acc_BG = metrics.accuracy_score(y_test, pred_BG)
tempResultsDf = pd.DataFrame({'Method':['Bagging'], 'accuracy': [acc_BG]})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy']]
resultsDf
resultsDf
Bagging has least accuracy score than Random forest and Adaboost.
from mlxtend.classifier import StackingClassifier
from sklearn import datasets
from sklearn import model_selection
clf1= KNeighborsClassifier(n_neighbors=6)
clf2=RandomForestClassifier(n_estimators = 50)
lr= LogisticRegression()
sclf=StackingClassifier(classifiers=[clf1,clf2],use_probas=True,meta_classifier=lr)
for clf,label in zip([clf1,clf2,sclf],['KNN','RandomForest','StackingClassifier']):
scores=model_selection.cross_val_score(clf,col_x,col_y,cv=3,scoring='f1_macro')
print("f1 scores : %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label))
In the meta classifier model the final f1 score is 0.69
KNN Model - K-NN model with n-neighbors = 6 is having accuracy score of (test accuracy = 96.6% and train accuracy = 93.38) and F1 score = 98% .
Logistic Regression - SFFS learning we are getting F1 score i.e. 93.97%
NAive baiyes - Naive baiyes model having accuracy score of (test accuracy = 69.49% and train accuracy = 77.94) and F1 score = 75%
Random forest - Above random forest model having accuracy score of = 0.830508
!jupyter nbconvert --to html Untitled.ipynb